{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6e31064a-0696-4186-8973-f8ed3e4a1717",
   "metadata": {},
   "source": [
    "# Using Various SelectionScheme Classes\n",
    "## Objective\n",
    "\n",
    "The objective of this tutorial is to learn how to fully utilize `set_method()` and learn different SelectionSchemes that are preset in QMzyme. SelectionSchemes is an abstract base class designed to assist users in constructing chemically meaningful QMzymeRegion objects. Regions may be defined using the MDAnalysis selection string syntax, or predefined SelectionScheme subclasses. Importantly, selection metadata are stored in the QMzymeRegion creation_params attribute, allowing users to trace how a region was generated and improving method transparency and reproducibility. The broader goal of the SelectionScheme module is to provide systematic and reproducible approaches for QM region selection in enzyme active site studies.\n",
    "\n",
    "This workflow allows you to:\n",
    "\n",
    "- Learn and utilize various SelectionSchemes that are present in QMzyme.\n",
    "\n",
    "In this specific example, we are using ketosteroid isomerase (KSI) as the model system. The structure for KSI is obtained from the PDB [1OH0](https://doi.org/10.2210/pdb1OH0/pdb) and MM-minimized prior to this tutorial.\n",
    "\n",
    "## Classes used in this example\n",
    "\n",
    "- [GenerateModel](https://qmzyme.readthedocs.io/en/latest/API/QMzyme.GenerateModel.html)\n",
    "- [QM_Method](https://qmzyme.readthedocs.io/en/latest/API/QMzyme.CalculateModel.html#qm-treatment)\n",
    "- [CA_terminal TruncationScheme](https://qmzyme.readthedocs.io/en/latest/API/QMzyme.TruncationSchemes.html#QMzyme.TruncationSchemes.CA_terminal)\n",
    "- [DistanceCutoff SelectionScheme](https://qmzyme.readthedocs.io/en/latest/API/QMzyme.SelectionSchemes.html#QMzyme.SelectionSchemes.DistanceCutoff)\n",
    "- [ChargeShiftAnalysis SelectionScheme](https://qmzyme.readthedocs.io/en/latest/API/QMzyme.SelectionSchemes.html#QMzyme.SelectionSchemes.ChargeShiftAnalysis)\n",
    "\n",
    "## Required Files\n",
    "\n",
    "To start, you will need:\n",
    "\n",
    "- A fully prepped and protonated PDB\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d59c38bf-e748-45e7-af8e-d81b000fb82d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here are the necesary imports for this tutorial!\n",
    "\n",
    "import QMzyme \n",
    "import MDAnalysis\n",
    "from QMzyme.data import PDB, CSA_pkl, CSA_holo, CSA_apo\n",
    "from QMzyme.SelectionSchemes import DistanceCutoff, ChargeShiftAnalysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6f45fd6-ac4f-4d6b-a42f-eef55fc28283",
   "metadata": {},
   "source": [
    "## Manual Region Selection\n",
    "\n",
    "First, we can directly select the residues of our interest as our QM region. In here, the user needs to define which specific amino acid residues are included in the QM region with `set_region(selection=)`. Be mindful that when selecting a region manually, you need to specify the name of the region. The selection nomenclature is derived from MDAnalysis.<sup>1,2</sup> To select multiple residues, we use \"resid A or resid B.\" As odd as this may seem, this translates to \"select atoms that are in residue A or residue B,\" which will ultimately result in the selection of these two amino acids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "599aa4fd-da0b-4f70-ae59-226713ff7244",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Charge information not present. QMzyme will try to guess region charges based on residue names consistent with AMBER naming conventions (i.e., aspartate: ASP --> Charge: -1, aspartic acid: ASH --> Charge: 0.). See QMzyme.data.residue_charges for the full set.\n",
      "\n",
      "\tNonconventional Residues Found\n",
      "\t------------------------------\n",
      "\tEQU --> Charge: UNK, defaulting to 0\n",
      "\n",
      "You can update charge information for nonconventional residues by running \n",
      "\t>>>QMzyme.data.residue_charges.update({'3LETTER_RESNAME':INTEGER_CHARGE}). \n",
      "Note your changes will not be stored after you exit your session. It is recommended to only alter the residue_charges dictionary. If you alter the protein_residues dictionary instead that could cause unintended bugs in other modules (TruncationSchemes).\n",
      "\n",
      "[<QMzymeRegion region contains 83 atom(s) and 4 residue(s)>]\n"
     ]
    }
   ],
   "source": [
    "model = QMzyme.GenerateModel(PDB)\n",
    "QMzyme.data.residue_charges.update({'EQU': -1})\n",
    "\n",
    "model.set_region(selection=\"resid 263 or resid 16 or resid 40 or resid 103\", name=\"region\")\n",
    "\n",
    "print(model.regions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac6426a8-96b4-4182-a6ac-9d52a6ff29e8",
   "metadata": {},
   "source": [
    "You can also select an atom group using `select_atoms()` from the MDAnalysis Universe and create a region of it.<sup>1,2</sup>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "17045d8a-766a-47c0-bafd-4fb7345f1be5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[<QMzymeRegion region contains 83 atom(s) and 4 residue(s)>, <QMzymeRegion MDA_selection contains 275 atom(s) and 19 residue(s)>]\n"
     ]
    }
   ],
   "source": [
    "u = MDAnalysis.Universe(PDB)\n",
    "atom_group = u.select_atoms('(byres around 3 resid 263) or resid 263')\n",
    "model.set_region(selection=atom_group, name=\"MDA_selection\")\n",
    "print(model.regions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "105a8097-155f-4798-8edf-5b9756acb98e",
   "metadata": {},
   "source": [
    "## Distance Cutoff\n",
    "\n",
    "The `DistanceCutoff` class selects residues based on their distance from a user-defined `set_catalytic_center()`. Specifically, the workflow identifies atoms within a specified cutoff distance of the catalytic center and determines which residues should be included in the resulting region. Users may choose whether the selection includes full-residue or partial-residue selection based only on atoms satisfying the cutoff criterion. Distance-based selection provides a straightforward method for constructing QM regions. However, while the distance cutoff scheme is conceptually simple, this ad-hoc approach often requires large QM regions in systematic convergence studies.<sup>3</sup>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e0356d8c-3dae-4275-88e3-dcf01534f903",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Charge information not present. QMzyme will try to guess region charges based on residue names consistent with AMBER naming conventions (i.e., aspartate: ASP --> Charge: -1, aspartic acid: ASH --> Charge: 0.). See QMzyme.data.residue_charges for the full set.\n",
      "[<QMzymeRegion catalytic_center contains 37 atom(s) and 1 residue(s)>, <QMzymeRegion cutoff_3 contains 275 atom(s) and 19 residue(s)>]\n"
     ]
    }
   ],
   "source": [
    "model = QMzyme.GenerateModel(PDB)\n",
    "QMzyme.data.residue_charges.update({'EQU': -1})\n",
    "\n",
    "model.set_catalytic_center(selection='resid 263')\n",
    "model.set_region(selection=DistanceCutoff, cutoff=3)\n",
    "\n",
    "print(model.regions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b42e56a-8738-48a7-88dd-f19f1fe86be5",
   "metadata": {},
   "source": [
    "## ChargeShiftAnalysis\n",
    "The `ChargeShiftAnalysis` class implements Charge Shift Analysis (CSA) for region residue selection.<sup>3,4</sup> CSA identifies catalytically important residues by analyzing differences in partial atomic charges between holo and apo forms of the biomolecular system. The workflow in QMzyme proceeds in two stages: \n",
    "\n",
    "1) An initial large-region selection (approximately 900-1000 atoms) is generated for both holo and apo systems using a distance-based cutoff (default range approximately 9-10 Å from the catalytic center). Quantum mechanical calculations are performed on both systems.  \n",
    "\n",
    "2) Partial charge differences are analyzed to identify residues that significantly perturb the electronic structure of the catalytic center.  \n",
    "\n",
    "The default selection size is based on recommendations from CSA development. Once the QM calculations are completed, the output files are supplied back into the CSA workflow to create the QMzymeRegion.\n",
    "\n",
    "The first iteration of ChargeShiftAnalysis requires users to undergo `set_catalytic_center()` and `QM_Method()`. This will generate two QM input files for frequency calculation. For site-directed mutagenesis to alanine, user can define specific protein residues in `alanine_mutation=`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6535aa18-88bb-439f-91b5-598165f1955c",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = QMzyme.GenerateModel(PDB)\n",
    "qm_method = QMzyme.QM_Method(\n",
    "    basis_set='6-31G*',\n",
    "    functional='wB97XD',\n",
    "    qm_input='pop=hirshfeld',\n",
    "    program='gaussian'\n",
    ")\n",
    "\n",
    "model.set_catalytic_center('resid 263')\n",
    "model.set_region(selection=ChargeShiftAnalysis, name=None, min_atoms=950, max_atoms=1050,\n",
    "    method=qm_method, alanine_mutation=None)\n",
    "\n",
    "print(model.regions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "894def89-06ee-4846-8445-d48bfe4541a5",
   "metadata": {},
   "source": [
    "After running QM calculations on these two files, partial charge difference is calculated and assigned to QMzymeAtoms. This is used to calculate the partial charge difference between the residues. Then,`ChargeShiftAnalysis` SelectionScheme will create a region based on the residues that have a higher partial charge difference than the user-defined `charge_threshold=`. To get a .csv file containing the partial charge calculations, you can set `charge_output_csv=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8c2cac81-e272-44f5-a54f-54ab82c2ef43",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calculating partial charge differences from provided QM output files.\n",
      "Successfully retrieved CSA_holo_truncated, CSA_apo_truncated from the QMzymeModel.\n",
      "Use of this selection scheme requires citing the following reference(s): \n",
      " \t(1) Kulik, Heather J.; Zhang, Jianyu; Klinman, Judith P.; Martinez, Todd J.(2016) How Large Should the QM Region Be in QM/MM Calculations? The Case of Catechol O-Methyltransferase. Journal of Physical Chemistry B, 120(44). and (2) Karelina, M., & Kulik, H. J. (2017). Systematic quantum mechanical region determination in QM/MM simulation. Journal of chemical theory and computation, 13(2).\n",
      "QMzymeRegion CSA_cutoff_region has an estimated charge of -1.\n",
      "[<QMzymeRegion catalytic_center contains 37 atom(s) and 1 residue(s)>, <QMzymeRegion CSA_holo contains 1008 atom(s) and 88 residue(s)>, <QMzymeRegion CSA_holo_truncated contains 982 atom(s) and 88 residue(s)>, <QMzymeRegion CSA_apo contains 971 atom(s) and 87 residue(s)>, <QMzymeRegion CSA_apo_truncated contains 945 atom(s) and 87 residue(s)>, <QMzymeRegion CSA_cutoff_region contains 77 atom(s) and 5 residue(s)>]\n"
     ]
    }
   ],
   "source": [
    "model = QMzyme.GenerateModel(pickle_file=CSA_pkl)\n",
    "qm_method = QMzyme.QM_Method(\n",
    "    basis_set='6-31G*', \n",
    "    functional='wB97X-D3', \n",
    "    qm_input='OPT FREQ', \n",
    "    program='orca'\n",
    ")\n",
    "\n",
    "model.set_region(selection=ChargeShiftAnalysis, method=qm_method, holo_output_files=CSA_holo,\n",
    "    apo_output_files=CSA_apo, pop=\"cm5\", charge_threshold=0.05)\n",
    "\n",
    "print(model.regions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30885afd-d4b8-4933-aced-0f4fc7878755",
   "metadata": {},
   "source": [
    "## References Cited\n",
    "\n",
    "1) N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations. J. Comput. Chem. 32 (2011), 2319–2327. doi:10.1002/jcc.21787\n",
    "2) R. J. Gowers, M. Linke, J. Barnoud, T. J. E. Reddy, M. N. Melo, S. L. Seyler, D. L. Dotson, J. Domanski, S. Buchoux, I. M. Kenney, and O. Beckstein. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 98-105, Austin, TX, 2016. SciPy. doi:10.25080/Majora-629e541a-00e\n",
    "3) Kulik, Heather J.; Zhang, Jianyu; Klinman, Judith P.; Martinez, Todd J.(2016) How Large Should the QM Region Be in QM/MM Calculations? The Case of Catechol O-Methyltransferase. Journal of Physical Chemistry B, 120(44).\n",
    "4) Karelina, M., & Kulik, H. J. (2017). Systematic quantum mechanical region determination in QM/MM simulation. Journal of chemical theory and computation, 13(2))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}